Statistical Disclosure Control

Contact:

Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060

Last update: 10 Oct 2011

OR-tools for tabular data (WP 4.1)

Leading partner: CBS

Participating partners: ULL, UPC

Task 1: (Responsability University La Laguna)

Objectives

We aim to develop, analyse and implement algorithms for solving some optimisation problems arising when protecting sensitive information in tabular data. In particular, it will address the resolution of applying Cell Suppression methodology on hierarchical and linked tables.

Description of work

We will study several algebraic structures to capture the links between cells, both in hierarchical and linked tables. This is a very relevant part as we will try to use Integer Programming for modelling the optimisation problems, and then we will try to use techniques based on Linear Programming to solve it. The models need to capture the structure of the tables in such a way that the different between the linear programming relaxation and the integer program is not too big. Otherwise, some additional constraints and polyhedral analysis must be executed to strengthening the linear relaxation. Polynomial-solvable cases and network relaxation must be studied for a better understanding of the problem.

In this workpackage a junior programmer will help on organising in a database our initial instances from CASC-partners and on getting statistics from them to a better understanding of the instances. It is also important at this phase to implement a pre-processing software to reduce the size of the instances as far as possible (fixing variables, removing links). We will also start a web page with to make public to our partners and to the Operations Research community how our work for CASC-project is going on. Some meetings with practitioner partners will be needed for the better understanding of the structure of the linked and hierarchical tables, and for solving questions arising while modelling the optimisation problems. As the work is mainly oriented to produce a software optimiser for dealing with real-world instances, we will need some hardware (personal computers and printers) and some software (compilers, memory checkers, and LP-solvers). This material will be intensively used also during all the project life.

Mathematical Programming proposes several general frameworks for ILP optimisation, including the branch-and-bound approach, cutting-plane procedures, Lagrangean relaxations, meta-heuristic algorithms, dynamic programming, etc. As a general rule, the particular features of the model in hand enforce special adaptations of these approaches to get solutions within acceptable computing times, even on medium-size instances. Sometimes the best results are obtained by a clever mixing of different general procedures such as branch-and-bound and cutting-plane generation. Cell suppression on linked and hierarchical tables are known to be very difficult from the computational point of view (NP-hard in the strong sense). The experience during the past SDC-project shows however that sophisticated methods can be successful in most cases, at least for 2- and 3-dimensional tables. The kernel of their algorithm for finding an optimal solution is based on an implicit enumeration scheme, generally called branch-and-bound approach. Pruning of the choices that need not to be explored explicitly is obtained by solving a sequence of tighter and tighter relaxation of the initial ILP model. For hierarchical and linked tables we will investigate an approach based on similar ideas, but taking into account the increased difficulty coming from the new structures to be handled.

Even if we are concerned with finding optimal solution to the problems in hand, near-optimal solutions are necessary to guide and reduce the computational effort spent within the implicit enumeration scheme. To this end two kinds of heuristic algorithms will also be addressed, one for getting an initial feasible solution and the other for providing near-optimal feasible solutions at each node of the branch-decision tree. In particular, this second heuristic algorithm will be invoked several times during the overall run, hence it must be very efficient. Heuristics imbedded within the enumeration scheme are also important because they allows the user to stop the run the exact algorithm before convergence, the output being in this case a (typically extremely tight) approximate solution with an indication of its distance from the optimum.

Also very important for the success of the algorithm in solving large instances is the pre-processing phase aimed at reducing the size of the input instance that need to be considered explicitly. For example, when applying cell suppression it often happens that some sensitive cells are already protected by the suppression of the other primary suppressions, hence they actually need no additional secondary suppressions. Moreover, when the instance corresponds with the subproblem coming out from a decomposition of a bigger instance (say a liked table) then it could be also infeasible, i.e., the protection levels can be set up so as there is not feasible suppression pattern. This situation creates serious problem because it is then even difficult to find approximated solutions, and it is a novel case not studied in previous works. We will address also this challenging instances.

Because the arising optimisation problem will be surely very difficult to be solved, it will be time-consuming to implement properly the theoretical ideas and make them work. Some meeting by the junior and/or senior with tester partners will be needed to fix final difficulties on the software. At month 30 the Callable library must be embedded in τ-ARGUS, one of the overall deliverable of the CASC-project, and all the new difficulties arising will be solved until month 35. In the mean time we will write and distribute scientific paper containing the main ingredients of the developed methodologies.

The complete portability of the code to different computer with different operation systems (and C compilers) will be ensured. The programming work will be done on compatible PC computers, and will be extensively tested against programming errors by using professional debuggers and memory checker software (Purify). Linear programming relaxations of ILP models will be solved by using and external robust (commercial) LP-solver. The good performance of the final software highly depends on the data structure adopted to store internally the input table and the relevant information. Alternative data structures will be proposed and compared computationally on the SDC-lib instances collected from our partners; these instances which will also be used for code testing during all the programming phases. Closer those instances are to the real-world instances, better performance would have our software. Therefore this work strongly needs appropriated (small, medium, and large sizes) tables from our partners.

During the last part of our work we will be listening and incorporating all the corrections and suggestions that our partners testing our software will give us. Nevertheless, the main help from our partners will come early in the project by providing us with real data. The software will try to solve to optimality medium instances in reasonable computing time on personal computers, but the precise definition of medium instances can not be set-up at the moment. From our experience we know that the complexity of exact methods grows drastically with the dimension of the table, so tables with cells inserted in more than four dimensions tend to be really difficult for exact approaches. We will expect that the structure of the real-world instances (in general much easier than random generated instances) would be capture by our approaches and therefore our software can solve bigger tables. Milestones and expected result

We expect to model mathematically the optimisation problems arising in the Statistical Disclosure Control on hierarchical and linked tables. We expect to insert the theoretical ideas into a code, solving the usually incoming difficulties. Finally, we expect to write and publish the main scientific results of our work in top-level journals.

Task 2 (responsability UPC)

Objectives

The aim of this task is twofold: firstly, augmenting τ-ARGUS with an algorithm for secondary cell suppression based on network flows algorithms; and secondly, attempting to improve this methodology with recently appeared optimisation methods. The first of the objectives is not new, and it has already been described and applied by other authors (e.g. Cox). However, τ-ARGUS lacks such a methodology, which has proven to be especially appropriate for large cell suppression problems. The optimisation problems that appear are large-scale network flow ones. For big tables, even the most efficient specialised network-simplex algorithms can result in long execution times. The second objective is directly related to reducing these execution times, by means of recently developed interior-point network flow algorithms. For big networks, these interior-point specialised methods have shown to be more efficient than specialised simplex algorithms. It would be worth while studying new possible solution strategies as well. Among these we just mention the application of algorithms for network flow problems with side constraints, with nonlinear objective functions, multicommodity flows, and semidefinite programming. In this workpackage, we will just focus on the multicommodity flows algorithm. Since the research component of this approach is considerable, the efficiency of its implementation (even its availability) can be a difficult goal to achieve.

Description of work

The first goal of this task is to implement an algorithm for secondary cell suppression. The algorithm would be based on the methodology described in Cox. Additional refinements for the heuristic procedures suggested for interval cell disclosure could be attempted. The algorithm would be implemented in C or C++, and ready to be integrated within τ-ARGUS. This new option of τ-ARGUS should be able to efficiently provide good solutions for big tables. An external network flow solver should be used (e.g., Xpress, Cplex 6.0 or PPRN).

To deal with large tables, a replacement of the network simplex algorithm based on a specialised interior-point network method will also be developed. This new algorithm could be selected by the user of τ-ARGUS as a new option of the package. The interior-point network solver should be more efficient than classical network simplex codes in very large two-dimensional tables (thousands of columns and/or thousands of rows). To our best knowledge, no implementations of such a methodology are yet available. The implementation would be developed in C or C++, and it would be integrated in τ-ARGUS. An external solver could be used for the crossover stage (i.e., recovering a basic solution from a nonbasic one).

The second goal is to study how to apply new optimisation methodologies to the higher-dimensional interval cell suppression problem. These methodologies would attempt to solve a continuous formulation of the cell suppression problem. This problem is usually formulated as an integer programming problem, guaranteeing its optimal solution at the expense of computational efficiency. In this task we would attempt to formulate a continuous (relaxed) problem, which would be solved through a multicommodity (linear or nonlinear) solver.

Milestones and expected result

Secondly an implementation of the network flow approach for secondary cell suppression.